Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • creating a random*sample for panel data - but should represent several countries

    Hello,

    I have a huge data file of unbalanced panel data( around 30 million ID-year observations).
    I want to create a random sample but keep all years' observations for a particular company ID.
    The main data file includes companies from 13 countries (variable: country_code).

    How can I create a random sample (of 100,000 different IDs) but keep all years' observations of a particular ID, and the sample should represent all 13 countries? Do I need to sort the data first?

    Code:
    sort ID year
    How the code would be different if I want a % based sample, for example, a sample of 10% of the full data, representing the 13 countries

    here's an example of the variables that are important for the sample

    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str18 id float year str2 country_code
    "IT00832220156" 1999 "IT"
    "GB01180514" 2018 "GB"
    "SE5560525692" 2020 "SE"
    "FR826680084" 2007 "FR"
    "IT00180310781" 2015 "IT"
    "ESA38403028" 2019 "ES"
    "FR418872081" 2006 "FR"
    "ESB04105847" 2011 "ES"
    "BE0819252201" 2010 "BE"
    "SE5562265917" 2016 "SE"
    "FR378281935" 2000 "FR"
    "ESB81518623" 2009 "ES"
    "ESB10343051" 2017 "ES"
    "ESB75041228" 2015 "ES"
    "IT02564770903" 2019 "IT"
    "FI22585062" 2019 "FI"
    "IT10607171005" 2016 "IT"
    "IT01817990276" 2015 "IT"
    "FR381004415" 2011 "FR"
    "IT01861600342" 1998 "IT"
    "FI17091885" 2012 "FI"
    "IT01777250976" 2008 "IT"
    "ESB31777808" 2009 "ES"
    "SE5569792962" 2020 "SE"
    "IT09727470966" 2021 "IT"
    "IT03005730738" 2018 "IT"
    "SE5563133395" 2011 "SE"
    "DK31170427" 2021 "DK"
    "FR338466295" 1998 "FR"
    "ESB97281521" 2019 "ES"
    "ESB15665748" 2005 "ES"
    "SE5564657269" 2013 "SE"
    "IT01930040561" 2013 "IT"
    "ESB04251955" 2013 "ES"
    "ESB14361810" 2003 "ES"
    "ESB12742821" 2011 "ES"
    end
    [/CODE]


    thank you for your help

  • #2
    Code:
    frame put id, into(ids)
    frame ids {
        duplicates drop
        set seed 1234 // OR WHATEVER SEED NUMBER YOU LIKE
        gen double shuffle1 = runiform()
        gen double shuffle2 = runiform()
        by shuffle1 shuffle2, sort: keep if _n <= 100000
    }
    frlink m:1 id, frame(ids)
    keep if !missing(ids)
    drop ids
    frame drop ids

    Comment


    • #3
      Hello Clyde,

      Thank you for the provided code, but unfortunately, didn't work with me.
      I received an error in Stata
      command frame is unrecognized
      , I think the command
      Code:
      frame
      doesn't work in Stata 14

      I also noticed that the code doesn't include the variable country_code. because the number of companies in each country is very different, some countries such as Germany and UK would have much more companies than other countries in my sample. Is it possible a random sample ,but take into account the unequal number of companies in different countries?

      Thank you for your help

      Comment


      • #4
        All Forum members are asked to read the Forum FAQ before posting. Had you done so, you would have read that if you are not using the current version of Stata (18) you are supposed to state what version you are using. That would have prevented both of us from wasting our time--me writing, and you trying to execute, code that is not possible for you.

        You raise two issues. The issue of Stata 14 not supporting -frame- commands has an easy workaround:
        Code:
        preserve
        keep id
        duplicates drop
        set seed 1234
        gen double shuffle1 = runiform()
        gen double shuffle2 = runiform()
        sort shuffle1 shuffle2
        keep if _n <= 100000
        drop shuffle*
        tempfile selections
        save `selections'
        
        restore
        merge m:1 id using `selections', keep(match) nogenerate
        The other issue, that some small countries might fail to be represented, is not so simple. You are, as I interpret it, looking to oversample small countries to assure that they are represented in the random sample. That can be done, but the only other constraint you have proposed is that the total sample size be 100,000. That constraint fails to narrow down the problem sufficiently. You will need to say how you want to upweight the small countries. One possibility is to say that you want to sample some fixed percentage of the id's in each country. Since you are looking for a sample size of around 100,000 and you are starting with 30,000,000, that means you will want, on average, about 1 in each 300 observations, or 0.33%. So you could redefine the problem as choosing a 1 in 300 random sample, stratifying on each country, with a limitation that at least one id must be chosen from each country. The exact sample size may not turn out to be 100,000, but it won't be far off, and it will guarantee each country's inclusion.

        That would look like this:
        Code:
        preserve
        keep country id
        duplicates drop
        set seed 1234
        gen double shuffle1 = runiform()
        gen double shuffle2 = runiform()
        by country (shuffle1 shuffle2), sort: keep if _n <= ceil(_N/300)
        drop shuffle*
        tempfile selections
        save `selections'
        
        restore
        merge m:1 id using `selections', keep(match) nogenerate
        If this approach doesn't work for you, post back with more specific sampling criteria.

        In any case, bear in mind that by oversampling the small countries, you will need to then downweight them in the analysis if your research questions involve obtaining results that are sensitive to between-country differences.

        Comment


        • #5
          Dear Clyde Schechter ,
          Apologies for not mentioning the Stata version, won't happen again.

          Thank you for the code provided and the explanation. I used the second code and it worked until the last step (merging)

          when using this code
          Code:
           
           merge m:1 id using `selections', keep(match) nogenerate
          I received the error message
          variable id does not uniquely identify observations in the using data
          which i find very weird, how there are not uniquely identified "id" id we create the sample from this file originally.

          I also tried saving the file not as a temp file , and used this
          Code:
          clear
          use 'original file.dta"
          merge m:1 using "selections" ,keep(match) nongenerate
          and also received the same error.
          what could possibly go wrong? can't really know

          I appreciate your help

          Many thanks

          Comment


          • #6
            I figured that there were duplicates in the 'selections' file, don't know why, dropped the duplicates then the merge worked.

            thanks

            Comment


            • #7
              Yes, there were duplicates in the `selections' file, but dropping them is not the correct solution. You got a result, but if it is correct, it is just luck. My code in #4 contained an error. The -merge- command should have been:
              Code:
              merge m:1 country id using `selections', keep(match) nogenerate
              I apologize for the error.

              Comment

              Working...
              X